Red Wine Quality
Data Analysis Project
Data analysis tools: Python, R
Helper tools: Quarto, RStudio, Git
Skills:
- statistical programming,
- data pre-processing,
- descriptive statistics and data visualization, EDA,
- inferential statistics: hypothesis testing and confidence intervals,
- statistical modelling: linear and logistic regression,
- literate programming.
Abbreviations
- CI – 95% confidence interval.
- CV – Cross-validation.
- Dim – dimension (principal component, the same as PC).
- n – either sample or group size.
- p – p-value.
- p_adj – p-value (adjusted).
- PCA – principal component analysis.
- PC – principal component.
- r – correlation coefficient.
- R² – r squared, coefficient of determination.
- RF – random forest.
- RNG – (pseudo)random number generator.
1 Introduction
Wine quality assessment and certification are important for wine making and selling processes. Certification helps to prevent counterfeiting and in this way protects people’s health as well as assures quality for the wine market. To identify the most influential factors is crucial for effective quality evaluation while to classify wines into quality groups is useful for setting prices (Cortez et al. 2009).
Vinho verde is wine from the Minho region, which is located in the northwest part of Portugal. In this project, data of red vinho verde wine specimens are investigated. The dataset was originally provided by (Cortez et al. 2009). This project mainly focuses on:
- prediction quality (to investigate how accurately (1) wine quality and (2) alcohol content in wine can be predicted) and
- explanation (to identify, which are the most important factors for the prediction tasks).
1.1 The Dataset
Various aspects of the dataset are described in various sources. Pieces of description from article of (Cortez et al. 2009) as well as from websites (“Red Wine Quality — Kaggle.com”; “Wine Quality Datasets — www3.dsi.uminho.pt”) were copied, merged and presented in this sub-section.
Variables based on physicochemical tests:
- fixed acidity (\(g_{\textrm{ tartaric acid}}/l\)): most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
- volatile acidity (\(g_{\textrm{ acetic acid}}/l\)): the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste.
- citric acid (\(g/l\)): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
- residual sugar (\(g/l\)): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
- chlorides (\(g_{\textrm{ sodium chloride}}/l\)): the amount of salt in the wine.
- free sulfur dioxide (\(mg/l\)): the free form of SO₂ exists in equilibrium between molecular SO₂ (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
- total sulfur dioxide (\(mg/l\)): amount of free and bound forms of SO₂; in low concentrations, SO₂ is mostly undetectable in wine, but at free SO₂ concentrations over 50 ppm, SO₂ becomes evident in the nose and taste of wine.
- density (\(g/cm^3\)): the density of water is close to that of water depending on the percent alcohol and sugar content.
- pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
- sulphates: a wine additive which can contribute to sulfur dioxide gas (SO₂) levels, which acts as an antimicrobial and antioxidant.
- alcohol (% vol.): the percent alcohol content of the wine.
Variable based on sensory data:
- quality (score between 0 and 10).
2 Methods
2.1 Additional Technical Tasks
RStudio is IDE for R and Python. It provides interface between R and Python for interactive analysis. So technical tasks were to:
- Try RStudio capabilities for Python programming (documentation, data analysis, plotting, etc.).
- Try RStudio interface between R and Python: use these 2 tools in the same file.
knitr Engine
Quarto is a full-featured system for technical and scientific publishing based on Pandoc. It works with jupyter (uses “Python”, “IR” or other kernels) and knitr (multilingual) engines, can be integrated with iPython Notebooks (.ipynb) or used with plain-text Quarto (.qmd)
- Try Quarto capabilities for reproducibility, bi-lingual (R and Python) analysis and technical publishing.
Main principles:
- Python is used for all steps which are directly related to putting model into production.
- Some exploratory analysis steps and other tasks that are easier to perform in R, are done in R.
Before deciding to do a bilingual project, I consulted my batch STL. He encouraged doing it and, later, to share my insights in the stand-up.
The reasons, pros and cons related to these technical tasks will be discussed in more detail during oral presentation.
2.2 Population and Sample
The population of interest is red vinho verde wines. It is assumed that the data is a simple random sample or its equivalent that allows using methods of inferential statistics.
2.3 Training and Test Sets
As the prediction step will be included in the analysis, the whole sample is divided into model training and test sets using 80:20 train/test hold-out strategy with stratification by wine quality scores.
Training set was used for:
- exploratory analysis;
- statistical inference;
- model creation and selection.
Test set was used only to evaluate the performance of the final models.
2.4 General: Inferential Statistics
For statistical inference, significance level \(\alpha=0.05\) and 95% confidence level are used.
2.5 Differences Between Groups
For statistical inference about several groups (categories), the following strategy is used:
- first, 95% confidence intervals (CI) for each group are calculated,
- second, an omnibus test is performed,
- In the analysis of counts (sizes of wine quality groups),
the following methods are used:- confidence intervals: Goodman’s simultaneous confidence intervals of proportions;
- omnibus: Pearson’s chi-square (χ²) goodness-of-fit (GOF) test,
- hypotheses:
\(H_0:\) All proportions are equal: \(~ \pi_1 = \pi_2 = ... = \pi_i = ... = \pi_k\)
\(H_1:\) At least two proportions differ: \(\pi_i \ne \pi_j\) for at least single pair of i and j.
- hypotheses:
\(i\) and \(j\) are group indices (\(i\) = 1, 2, …, \(k\); \(j\) = 1, 2, …, \(k\); \(i \ne j\)),
\(k\) – total number of groups.
2.6 Correlation Between Variables
For relationship between numeric variables, parametric Pearson’s correlation analysis is used.
Hypotheses:
\(H_0:\) \(\rho = 0\) (variables do not correlate),
\(H_1:\) \(\rho \ne 0\) (variables correlate).Here \(\rho\) (rho) is population Pearson’s correlation coefficient.
In statistical significance testing, as hypotheses are tested multiple times, to prevent inflated type 1 error rate, Holm correction is applied.
2.7 Regression Task
To predict alcohol content, linear regression is used. For feature selection, several strategies were applied:
- statistical inference and assumption checking based strategy,
- sequential feature selection (sfs) with 5-fold cross validation.
Main metric for feature selection was R².
2.8 Classification Task
To predict wine quality, logistic regression is used. For feature selection, statistical inference based strategy was used.
2.9 Feature Engineering
In some cases data were right-skewed. So this type of variables were investigated on the the original scale (not transformed) and after logarithmic transformation.
3 Preparation and Inspection
3.1 Setup
Both R and Python will be used in this project. R code is marked with blue and Python code with yellow stripes.
Code: R setup
# R setup ================================================
# R packages ---------------------------------------------
library(tidyverse)
library(reticulate)
library(factoextra)
library(DescTools)
library(patchwork)
# R options ----------------------------------------------
options(scipen = 5)
DescToolsOptions(stamp = NULL, digits = 3)
# Default ggplot2 theme
theme_set(theme_bw())
# Other options
knitr::opts_chunk$set(fig.height = 3.5, fig.width = 6, fig.align = "center")
# RNG state
set.seed(100)Code: Python setup
# Python setup =========================================
# Packages and modules ---------------------------------
# Data wrangling, math
import numpy as np
import pandas as pd
import janitor
from numpy import log
# Statistical analysis and modeling
import patsy
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import pingouin as pg
import scikit_posthocs as sp
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Python Settings -----------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))
# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 40) # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.3f}")
pd.set_option("styler.format.thousands", ",")
# RNG state
np.random.seed(100)Code: Ad-hoc Python functions
# Ad-hoc Python functions
def group_quality_scores(score):
"""
Wine quality score will be recoded as quality group as follows:
- score ≤ 4 = low,
- score [5-6] = medium,
- score ≥ 7 = high.
args:
score (int, float): numeric value from 0 to 10
"""
assert 0 <= score <= 10
score = int(score)
if score <= 4:
return "low"
elif 5 <= score <= 6:
return "medium"
elif 7 <= score:
return "high"
# Compute the vif for all given features
def compute_vif(X, sort=True):
"""Compute variance inflation factor, VIF
Args:
X (pandas.DataFrame): features without the target variable.
sort (bool): True - sort VIF values
Return:
vif (pandas.DataFrame): dataframe vith columns "feature" and "vif"
"""
# Requires
# statsmodels.stats.outliers_influence.variance_inflation_factor
from statsmodels.stats.outliers_influence import variance_inflation_factor
# The calculation of variance inflation REQUIRES a constant
# as calculations are based on `statsmodels`
X['(intercept)'] = 1
# Create data frame to store VIF values
vif = pd.DataFrame()
vif["feature"] = X.columns
vif["vif"] = [
variance_inflation_factor(X.values, i) for i in range(X.shape[1])
]
# Remove intercept (constant) term
vif = vif[vif['feature'] != '(intercept)']
# Sort
if sort:
vif = vif.sort_values("vif", ascending = False)
# Output
return vif
def format_p0(p):
"""Format p values at 3 decimal places.
Args:
p (float): p value (number between 0 and 1).
"""
if p < 0.001:
return "<0.001"
elif p > 0.999:
return ">0.999"
else:
return f"{p:.3f}"
def get_rf_importance(formula, data, y_1d=True):
"""Get feature importance from random forests regression model.
Args:
formula (str): R-like model formula
data (pandas.DataFrame): dataset
y_1d (bool): Should y model matrix be 1D array?
Return:
pandas.DataFrame with columns "terms" and "importance".
"""
y, X = patsy.dmatrices(formula, data)
if y_1d:
y = y.ravel()
model = RandomForestRegressor().fit(X, y);
importance = model.feature_importances_
return (pd.DataFrame({
"terms": X.design_info.column_names,
"importance": model.feature_importances_
})
.sort_values("importance", ascending=False)
)
def pairwise_correlation(data, method="pearson"):
"""Correlation analysis (including p value calculation)
Args:
data (pandas.DataFrame): dataset with numeric variables
method (str): correlation method, e.g., "pearson" or "spearman".
Return:
pandas.DataFrame with the resilts of correlation analysis.
"""
return (pg.pairwise_corr(data, method=method, padjust="holm")
.sort_values(by='r', key=abs, ascending=False)
.transform_column("p-corr", format_p0, "p_adj")
.select_columns(["X", "Y", "r", "CI95%", "p_adj"])
)
# Functions for regression/classification
def do_sfs_linear_regression(formula, data, forward=True):
"""Do sequential feature selection for Linear Regtession
Args.:
formula (str): R style formula
data (pandas.DataFrame)
forward (bool): True – forward selection
False – backward selection
"""
lin_regression = LinearRegression(fit_intercept=True)
sfs = SequentialFeatureSelector(
lin_regression, k_features="parsimonious", forward=forward, floating=True,
scoring="r2", verbose=0, cv=5
)
y, X = patsy.dmatrices(
formula + "+ 0", data, return_type='dataframe'
)
return sfs.fit(X, y)
def get_sfs_performance(sfs_object, n_features):
"""Return performance measures with certain numer of features.
Args.:
sfs_object: result of function do_sfs_linear_regression()
n_features (int): number of features.
"""
md = round(
np.median(sfs_object.get_metric_dict()[n_features]['cv_scores']),
3
)
return ({
"n_features": n_features,
"mean R²": round(sfs_object.get_metric_dict()[n_features]['avg_score'], 3),
"median R²": md,
"sd R²": round(sfs_object.get_metric_dict()[n_features]['std_dev'], 3)
})
def show_sfs_results_lin_reg(sfs_object, sub_title=""):
"""Show results of do_sfs_linear_regression()
Args.:
sfs_object: result of function do_sfs_linear_regression()
sub_title (str): second line of title.
"""
if sfs_object.forward:
sfs_type = "Forward"
else:
sfs_type = "Backward"
plt.clf()
fig, ax = plt.subplots(1, 2)
avg_score = [(i, c["avg_score"]) for i, c in sfs_object.subsets_.items()]
(
pd.DataFrame(avg_score, columns=["n_features", "avg_score"])
.plot.scatter(
x="n_features",
y="avg_score",
xlabel="Number of features included",
ylabel="Average R²",
ylim=(0.1, 0.8),
ax=ax[0]
)
);
cv_scores = {i:c["cv_scores"] for i, c in sfs_object.subsets_.items()}
(
pd.DataFrame(cv_scores)
# .drop([1, 2, 3], axis=1)
.plot.box(
xlabel="Number of features included",
ylabel="R²",
ylim=(0.1, 0.8),
ax=ax[1]
)
);
if not sfs_object.forward:
ax[1].invert_xaxis();
main_title = (
f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV " +
f"\n{sub_title}"
)
fig.suptitle(main_title);
plt.tight_layout();
plt.show()
# Print results
n_selected = f"n = {len(sfs_object.k_feature_names_)}"
r_sq_selected = f"avg. R² = {sfs_object.k_score_:.3f}"
print(main_title)
print("\nSelected variables:")
print(f"[ {n_selected}, {r_sq_selected} ]\n")
for i, name in enumerate(sfs_object.k_feature_names_, start=1):
print(f"{i}. {name}")3.2 Import Data
Data was provided as a text file.
Preview of data file
A few lines from the text file with data:
R code
read_lines("data/winequality-red.csv", n_max = 6) |> writeLines()"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
Python code
wine_raw = pd.read_csv("data/winequality-red.csv", delimiter=";")The dataset contains 1599 rows and 12 columns. All variables are numeric. No missing values were found.
Details: inspect wine_raw dataset
The properties of the imported dataset:
Python code
wine_raw.shape(1599, 12)
R code
head(py$wine_raw)R code
glimpse(py$wine_raw)Rows: 1,599
Columns: 12
$ `fixed acidity` <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7…
$ `volatile acidity` <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600…
$ `citric acid` <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,…
$ `residual sugar` <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.…
$ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069…
$ `free sulfur dioxide` <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, …
$ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10…
$ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,…
$ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,…
$ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,…
$ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1…
$ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5,…
Number of missing values:
Python code
wine_raw.isna().sum()fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
Python code
wine_raw.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
3.3 Pre-Processing and Inspection
3.3.1 Recoding and Renaming Variables
Variable names will be changed into lower snake case (e.g., to make them easier to be used in statistical model formulas). Wine quality score will be merged into quality groups as follows:
- ≤ 4 – low,
- [5-6] – medium,
- ≥ 7 – high.
Python code
wine = wine_raw.clean_names()
wine['quality_gr'] = pd.Categorical(
wine['quality'].apply(group_quality_scores),
categories=["low", "medium", "high"],
ordered=True
)
del wine_rawHow do quality scores and quality groups match?
R code
py$wine |> with(table(quality_gr, quality)) |> addmargins() |> pander::pander()| 3 | 4 | 5 | 6 | 7 | 8 | Sum | |
|---|---|---|---|---|---|---|---|
| low | 10 | 53 | 0 | 0 | 0 | 0 | 63 |
| medium | 0 | 0 | 681 | 638 | 0 | 0 | 1319 |
| high | 0 | 0 | 0 | 0 | 199 | 18 | 217 |
| Sum | 10 | 53 | 681 | 638 | 199 | 18 | 1599 |
Variable recoding was performed without errors as numbers add up correctly.
Details: inspect wine dataset
Column names:
R code
colnames(py$wine) [1] "fixed_acidity" "volatile_acidity" "citric_acid"
[4] "residual_sugar" "chlorides" "free_sulfur_dioxide"
[7] "total_sulfur_dioxide" "density" "ph"
[10] "sulphates" "alcohol" "quality"
[13] "quality_gr"
Dataset:
R code
head(py$wine)R code
glimpse(py$wine)Rows: 1,599
Columns: 13
$ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
$ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
$ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
$ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
$ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
$ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
$ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
$ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
$ ph <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
$ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
$ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
$ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
$ quality_gr <ord> medium, medium, medium, medium, medium, medium, m…
3.3.2 Split into Training/Test Sets
In this project, for more accurate models’ prediction power, a 80:20 train/test hold-out split with stratification by wine quality scores will be used.
Python code
wine_train, wine_test = train_test_split(
wine,
test_size=0.2,
random_state=50,
stratify=wine['quality']
)
del wineThe sizes of training and test sets after splitting are 1279:320 wine samples. If needed, find more details below.
Details: inspect data after splitting to train/test sets
The dimensions of training set:
Python code
wine_train.shape(1279, 13)
The dimensions of test set:
Python code
wine_test.shape(320, 13)
Is the split really stratified? Yes as sample size ratios (column ratio) in all strata are around 0.80:
R code
full_join(
py$wine_train |> count(quality, name = "n_train"),
py$wine_test |> count(quality, name = "n_test"),
by = "quality"
) |>
mutate(ratio = round(n_train / (n_train + n_test), digits = 2))Dataset (train only):
R code
head(py$wine_train)R code
glimpse(py$wine_train)Rows: 1,279
Columns: 13
$ fixed_acidity <dbl> 6.7, 8.6, 10.9, 7.7, 7.4, 10.3, 10.0, 9.0, 9.0, 8…
$ volatile_acidity <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.490, …
$ citric_acid <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.25, 0…
$ residual_sugar <dbl> 2.1, 2.0, 4.6, 2.1, 2.4, 2.0, 11.0, 2.0, 2.8, 2.0…
$ chlorides <dbl> 0.064, 0.110, 0.118, 0.102, 0.090, 0.069, 0.071, …
$ free_sulfur_dioxide <dbl> 18, 19, 10, 11, 11, 21, 13, 8, 28, 11, 9, 40, 32,…
$ total_sulfur_dioxide <dbl> 34, 133, 17, 32, 37, 51, 50, 21, 104, 27, 22, 83,…
$ density <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, 0.99…
$ ph <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.27, 3…
$ sulphates <dbl> 0.64, 1.98, 0.56, 0.48, 0.76, 0.72, 0.69, 0.72, 0…
$ alcohol <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000, 11…
$ quality <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, 4, 6…
$ quality_gr <ord> medium, medium, medium, medium, medium, medium, m…
To avoid additional biases, test set will not be investigated until the final prediction models are created. Now let’s focus on training set only.
From here, further exploratory and inferential analysis as well as statistical modelling will use training set only.
3.3.3 Further Inspection of Training Set
Graphical and numeric inspection of each variable (find the details below) revealed that some variables are either moderately (skewness > 0.5) or highly (skewness > 1) right-skewed. As it is planned to use linear models in this project, skewed data may lead to non-linearities. This issue will be accounted for in the next subsection.
Details: plots and summaries of each variable on original scale
R code
DescTools::Desc(
py$wine_train |> mutate(quality = ordered(quality)),
verbose = 3
)------------------------------------------------------------------------------
Describe mutate(py$wine_train, quality = ordered(quality)) (data.frame):
data frame: 1279 obs. of 13 variables
1279 complete cases (100.0%)
Nr ColName Class NAs Levels
1 fixed_acidity numeric .
2 volatile_acidity numeric .
3 citric_acid numeric .
4 residual_sugar numeric .
5 chlorides numeric .
6 free_sulfur_dioxide numeric .
7 total_sulfur_dioxide numeric .
8 density numeric .
9 ph numeric .
10 sulphates numeric .
11 alcohol numeric .
12 quality ordered, factor . (6): 1-3, 2-4, 3-5, 4-6,
5-7, ...
13 quality_gr ordered, factor . (3): 1-low, 2-medium,
3-high
------------------------------------------------------------------------------
1 - fixed_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 94 0 8.30 8.21
100.0% 0.0% 0.0% 8.40
.05 .10 .25 median .75 .90 .95
6.10 6.50 7.10 7.90 9.20 10.60 11.71
range sd vcoef mad IQR skew kurt
11.30 1.75 0.21 1.48 2.10 1.02 1.26
lowest : 4.6, 4.9, 5.0 (4), 5.1 (3), 5.2 (6)
highest: 14.3, 15.0 (2), 15.5, 15.6 (2), 15.9
' 95%-CI (classic)
------------------------------------------------------------------------------
2 - volatile_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 0.53 0.52
100.0% 0.0% 0.0% 0.54
.05 .10 .25 median .75 .90 .95
0.27 0.31 0.40 0.52 0.64 0.76 0.85
range sd vcoef mad IQR skew kurt
1.46 0.18 0.34 0.18 0.24 0.71 1.32
lowest : 0.12 (3), 0.16 (2), 0.18 (8), 0.19, 0.2 (2)
highest: 1.18, 1.18, 1.24, 1.33 (2), 1.58
' 95%-CI (classic)
------------------------------------------------------------------------------
3 - citric_acid (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 78 110 0.27 0.26
100.0% 0.0% 8.6% 0.28
.05 .10 .25 median .75 .90 .95
0.00 0.01 0.09 0.25 0.42 0.53 0.60
range sd vcoef mad IQR skew kurt
0.79 0.19 0.73 0.25 0.33 0.32 -0.88
lowest : 0.0 (110), 0.01 (27), 0.02 (40), 0.03 (21), 0.04 (25)
highest: 0.74 (4), 0.75, 0.76 (2), 0.78, 0.79
heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)
' 95%-CI (classic)
------------------------------------------------------------------------------
4 - residual_sugar (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 85 0 2.55 2.47
100.0% 0.0% 0.0% 2.63
.05 .10 .25 median .75 .90 .95
1.50 1.70 1.90 2.20 2.60 3.61 5.20
range sd vcoef mad IQR skew kurt
14.50 1.44 0.57 0.44 0.70 4.42 26.74
lowest : 0.9, 1.2 (7), 1.3 (5), 1.4 (33), 1.5 (22)
highest: 12.9, 13.4, 13.8 (2), 13.9, 15.4 (2)
heap(?): remarkable frequency (9.9%) for the mode(s) (= 2)
' 95%-CI (classic)
------------------------------------------------------------------------------
5 - chlorides (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 0.0867 0.0843
100.0% 0.0% 0.0% 0.0891
.05 .10 .25 median .75 .90 .95
0.0540 0.0600 0.0700 0.0790 0.0900 0.1062 0.1242
range sd vcoef mad IQR skew kurt
0.4520 0.0437 0.5045 0.0148 0.0200 5.0275 31.2563
lowest : 0.012 (2), 0.034, 0.038 (2), 0.039 (4), 0.041 (4)
highest: 0.413, 0.414 (2), 0.415 (2), 0.422, 0.464
' 95%-CI (classic)
------------------------------------------------------------------------------
6 - free_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 58 0 15.70 15.12
100.0% 0.0% 0.0% 16.29
.05 .10 .25 median .75 .90 .95
4.00 5.00 7.00 13.00 21.00 30.00 35.00
range sd vcoef mad IQR skew kurt
71.00 10.60 0.68 10.38 14.00 1.31 2.24
lowest : 1.0 (2), 3.0 (44), 4.0 (34), 5.0 (86), 5.5
highest: 54.0, 55.0 (2), 66.0, 68.0 (2), 72.0
heap(?): remarkable frequency (9.1%) for the mode(s) (= 6)
' 95%-CI (classic)
------------------------------------------------------------------------------
7 - total_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 140 0 45.67 43.85
100.0% 0.0% 0.0% 47.49
.05 .10 .25 median .75 .90 .95
11.00 14.00 21.00 37.00 61.00 94.00 112.00
range sd vcoef mad IQR skew kurt
283.00 33.16 0.73 26.69 40.00 1.61 4.46
lowest : 6.0 (3), 7.0 (4), 8.0 (10), 9.0 (13), 10.0 (24)
highest: 155.0, 160.0, 165.0, 278.0, 289.0
' 95%-CI (classic)
------------------------------------------------------------------------------
8 - density (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 400 0 0.996736 0.996633
100.0% 0.0% 0.0% 0.996839
.05 .10 .25 median .75 .90 .95
0.993598 0.994480 0.995600 0.996720 0.997825 0.999108 1.000000
range sd vcoef mad IQR skew kurt
0.013620 0.001883 0.001889 0.001661 0.002225 0.127046 0.867452
lowest : 0.99007, 0.99064 (2), 0.99084, 0.9912, 0.9915
highest: 1.0026, 1.00289, 1.00315 (2), 1.0032, 1.00369 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
9 - ph (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 86 0 3.31 3.30
100.0% 0.0% 0.0% 3.32
.05 .10 .25 median .75 .90 .95
3.07 3.13 3.21 3.31 3.40 3.51 3.57
range sd vcoef mad IQR skew kurt
1.15 0.15 0.05 0.15 0.19 0.25 0.88
lowest : 2.86, 2.87, 2.88 (2), 2.89 (2), 2.9
highest: 3.72 (3), 3.75, 3.78 (2), 3.9 (2), 4.01 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
10 - sulphates (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 88 0 0.653 0.644
100.0% 0.0% 0.0% 0.662
.05 .10 .25 median .75 .90 .95
0.470 0.490 0.550 0.620 0.730 0.840 0.921
range sd vcoef mad IQR skew kurt
1.650 0.163 0.250 0.119 0.180 2.263 10.489
lowest : 0.33, 0.37, 0.39 (5), 0.4 (4), 0.42 (4)
highest: 1.59, 1.61, 1.62, 1.95, 1.98
' 95%-CI (classic)
------------------------------------------------------------------------------
11 - alcohol (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 61 0 10.42 10.36
100.0% 0.0% 0.0% 10.48
.05 .10 .25 median .75 .90 .95
9.20 9.30 9.50 10.20 11.10 12.00 12.50
range sd vcoef mad IQR skew kurt
6.50 1.06 0.10 1.04 1.60 0.87 0.23
lowest : 8.4 (2), 8.5, 8.7, 8.8 (2), 9.0 (22)
highest: 13.4 (2), 13.5, 13.6, 14.0 (6), 14.9
heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)
' 95%-CI (classic)
------------------------------------------------------------------------------
12 - quality (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 6 6 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 3 8 0.6% 8 0.6%
2 4 42 3.3% 50 3.9%
3 5 545 42.6% 595 46.5%
4 6 510 39.9% 1'105 86.4%
5 7 159 12.4% 1'264 98.8%
6 8 15 1.2% 1'279 100.0%
------------------------------------------------------------------------------
13 - quality_gr (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 3 3 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 low 50 3.9% 50 3.9%
2 medium 1'055 82.5% 1'105 86.4%
3 high 174 13.6% 1'279 100.0%
3.3.4 Log-Transformation of Right-Skewed Variables
The amount of skewness in variables was investigated before (columns skewness_original in the table below) and after (skewness_after_log) log-transformation, which was applied for variables with skewness above 0.5.
Python code
skewness = (
wine_train
.skew(numeric_only=True)
.sort_values(ascending=False)
.rename("skewness_original")
)
skewed_vars = skewness.where(skewness > 0.5).dropna().index.tolist()
skewness_after = (
wine_train
.transform_columns(skewed_vars, np.log10)
.skew(numeric_only=True)
.rename("skewness_after_log")
)
skew_diff = (skewness_after - skewness).rename("difference")
pd.concat(
[skewness, skewness_after, skew_diff], axis=1
).reset_index(names="variable") variable skewness_original skewness_after_log difference
0 chlorides 5.039 1.530 -3.509
1 residual_sugar 4.435 1.803 -2.632
2 sulphates 2.268 0.852 -1.416
3 total_sulfur_dioxide 1.616 -0.033 -1.649
4 free_sulfur_dioxide 1.313 -0.172 -1.485
5 fixed_acidity 1.025 0.426 -0.599
6 alcohol 0.870 0.669 -0.201
7 volatile_acidity 0.710 -0.434 -1.144
8 citric_acid 0.323 0.323 0.000
9 ph 0.254 0.254 0.000
10 quality 0.226 0.226 0.000
11 density 0.127 0.127 0.000
Guidelines to evaluate degree of skewness (coefficient of asymmetry)
I use these guidelines to interpret skewness coefficient.
- Interpret the direction of skewness:
- \(< 0\): left-skewed data;
- \(0\): ideally symmetric data;
- \(0 <\): right-skewed data.
- Interpret the amount of skewness (rule of thumb):
- \(0\): ideally symmetric;
- \(-0.5\) – \(+0.5\): almost symmetric;
- \(-1\) – \(-0.5\) and \(+0.5\) – \(+1\): moderate asymmetry;
- \(< -1\) and \(+1 < ~\): high asymmetry.
In all cases, amount of skewness was reduced. Just for alcohol, the coefficient changed not much. And in case of volatile_acidity, from moderately right-skewed data became into slightly left-sewed.
Apply the transformation to the dataset:
Python code
# Create a copy of data frame that contains log-transformed variables
new_names_map = {i:"log_" + str(i) for i in skewed_vars}
wine_train_log = (
wine_train
.transform_columns(skewed_vars, np.log10)
.rename(new_names_map, axis=1)
)Details: inspect data after log-transformation
Column names:
R code
colnames(py$wine_train_log) [1] "log_fixed_acidity" "log_volatile_acidity"
[3] "citric_acid" "log_residual_sugar"
[5] "log_chlorides" "log_free_sulfur_dioxide"
[7] "log_total_sulfur_dioxide" "density"
[9] "ph" "log_sulphates"
[11] "log_alcohol" "quality"
[13] "quality_gr"
Dataset:
R code
head(py$wine_train_log)R code
glimpse(py$wine_train_log)Rows: 1,279
Columns: 13
$ log_fixed_acidity <dbl> 0.8260748, 0.9344985, 1.0374265, 0.8864907, 0…
$ log_volatile_acidity <dbl> -0.318758763, -0.309803920, -0.275724130, 0.0…
$ citric_acid <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.2…
$ log_residual_sugar <dbl> 0.3222193, 0.3010300, 0.6627578, 0.3222193, 0…
$ log_chlorides <dbl> -1.1938200, -0.9586073, -0.9281180, -0.991399…
$ log_free_sulfur_dioxide <dbl> 1.2552725, 1.2787536, 1.0000000, 1.0413927, 1…
$ log_total_sulfur_dioxide <dbl> 1.531479, 2.123852, 1.230449, 1.505150, 1.568…
$ density <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, …
$ ph <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.2…
$ log_sulphates <dbl> -0.19382003, 0.29666519, -0.25181197, -0.3187…
$ log_alcohol <dbl> 0.9867717, 0.9912261, 1.0681859, 1.0000000, 0…
$ quality <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, …
$ quality_gr <ord> medium, medium, medium, medium, medium, mediu…
Details: plots and summaries of variables after log-transformation
R code
DescTools::Desc(
py$wine_train_log |> mutate(quality = ordered(quality)),
verbose = 3, digits = 3
)------------------------------------------------------------------------------
Describe mutate(py$wine_train_log, quality = ordered(quality)) (data.frame):
data frame: 1279 obs. of 13 variables
1279 complete cases (100.0%)
Nr ColName Class NAs Levels
1 log_fixed_acidity numeric .
2 log_volatile_acidity numeric .
3 citric_acid numeric .
4 log_residual_sugar numeric .
5 log_chlorides numeric .
6 log_free_sulfur_dioxide numeric .
7 log_total_sulfur_dioxide numeric .
8 density numeric .
9 ph numeric .
10 log_sulphates numeric .
11 log_alcohol numeric .
12 quality ordered, factor . (6): 1-3, 2-4, 3-5,
4-6, 5-7, ...
13 quality_gr ordered, factor . (3): 1-low, 2-medium,
3-high
------------------------------------------------------------------------------
1 - log_fixed_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 94 0 0.910 0.906
100.0% 0.0% 0.0% 0.915
.05 .10 .25 median .75 .90 .95
0.785 0.813 0.851 0.898 0.964 1.025 1.069
range sd vcoef mad IQR skew kurt
0.539 0.087 0.095 0.078 0.113 0.425 0.131
lowest : 0.663, 0.690, 0.699 (4), 0.708 (3), 0.716 (6)
highest: 1.155, 1.176 (2), 1.190, 1.193 (2), 1.201
' 95%-CI (classic)
------------------------------------------------------------------------------
2 - log_volatile_acidity (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 1 -0.301 -0.310
100.0% 0.0% 0.1% -0.292
.05 .10 .25 median .75 .90 .95
-0.569 -0.509 -0.398 -0.284 -0.194 -0.119 -0.071
range sd vcoef mad IQR skew kurt
1.119 0.156 -0.517 0.153 0.204 -0.433 0.364
lowest : -0.921 (3), -0.796 (2), -0.745 (8), -0.721, -0.699 (2)
highest: 0.072, 0.074, 0.093, 0.124 (2), 0.199
' 95%-CI (classic)
------------------------------------------------------------------------------
3 - citric_acid (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 78 110 0.268 0.257
100.0% 0.0% 8.6% 0.278
.05 .10 .25 median .75 .90 .95
0.000 0.010 0.090 0.250 0.420 0.530 0.600
range sd vcoef mad IQR skew kurt
0.790 0.195 0.728 0.252 0.330 0.322 -0.882
lowest : 0.000 (110), 0.010 (27), 0.020 (40), 0.030 (21), 0.040 (25)
highest: 0.740 (4), 0.750, 0.760 (2), 0.780, 0.790
heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)
' 95%-CI (classic)
------------------------------------------------------------------------------
4 - log_residual_sugar (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 85 0 0.369 0.361
100.0% 0.0% 0.0% 0.378
.05 .10 .25 median .75 .90 .95
0.176 0.230 0.279 0.342 0.415 0.558 0.716
range sd vcoef mad IQR skew kurt
1.233 0.158 0.428 0.094 0.136 1.799 4.740
lowest : -0.046, 0.079 (7), 0.114 (5), 0.146 (33), 0.176 (22)
highest: 1.111, 1.127, 1.140 (2), 1.143, 1.188 (2)
heap(?): remarkable frequency (9.9%) for the mode(s) (= 0.301029995663981)
' 95%-CI (classic)
------------------------------------------------------------------------------
5 - log_chlorides (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 141 0 -1.090 -1.098
100.0% 0.0% 0.0% -1.083
.05 .10 .25 median .75 .90 .95
-1.268 -1.222 -1.155 -1.102 -1.046 -0.974 -0.906
range sd vcoef mad IQR skew kurt
1.587 0.141 -0.129 0.078 0.109 1.527 8.730
lowest : -1.921 (2), -1.469, -1.420 (2), -1.409 (4), -1.387 (4)
highest: -0.384, -0.383 (2), -0.382 (2), -0.375, -0.333
' 95%-CI (classic)
------------------------------------------------------------------------------
6 - log_free_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 58 2 1.098 1.082
100.0% 0.0% 0.2% 1.115
.05 .10 .25 median .75 .90 .95
0.602 0.699 0.845 1.114 1.322 1.477 1.544
range sd vcoef mad IQR skew kurt
1.857 0.301 0.274 0.339 0.477 -0.172 -0.566
lowest : 0.000 (2), 0.477 (44), 0.602 (34), 0.699 (86), 0.740
highest: 1.732, 1.740 (2), 1.820, 1.833 (2), 1.857
heap(?): remarkable frequency (9.1%) for the mode(s) (= 0.778151250383644)
' 95%-CI (classic)
------------------------------------------------------------------------------
7 - log_total_sulfur_dioxide (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 140 0 1.553 1.536
100.0% 0.0% 0.0% 1.570
.05 .10 .25 median .75 .90 .95
1.041 1.146 1.322 1.568 1.785 1.973 2.049
range sd vcoef mad IQR skew kurt
1.683 0.310 0.199 0.335 0.463 -0.033 -0.693
lowest : 0.778 (3), 0.845 (4), 0.903 (10), 0.954 (13), 1.000 (24)
highest: 2.190, 2.204, 2.217, 2.444, 2.461
' 95%-CI (classic)
------------------------------------------------------------------------------
8 - density (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 400 0 0.997 0.997
100.0% 0.0% 0.0% 0.997
.05 .10 .25 median .75 .90 .95
0.994 0.994 0.996 0.997 0.998 0.999 1.000
range sd vcoef mad IQR skew kurt
0.014 0.002 0.002 0.002 0.002 0.127 0.867
lowest : 0.990, 0.991 (2), 0.991, 0.991, 0.992
highest: 1.003, 1.003, 1.003 (2), 1.003, 1.004 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
9 - ph (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 86 0 3.312 3.304
100.0% 0.0% 0.0% 3.321
.05 .10 .25 median .75 .90 .95
3.070 3.130 3.210 3.310 3.400 3.510 3.570
range sd vcoef mad IQR skew kurt
1.150 0.153 0.046 0.148 0.190 0.254 0.881
lowest : 2.860, 2.870, 2.880 (2), 2.890 (2), 2.900
highest: 3.720 (3), 3.750, 3.780 (2), 3.900 (2), 4.010 (2)
' 95%-CI (classic)
------------------------------------------------------------------------------
10 - log_sulphates (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 88 1 -0.196 -0.201
100.0% 0.0% 0.1% -0.191
.05 .10 .25 median .75 .90 .95
-0.328 -0.310 -0.260 -0.208 -0.137 -0.076 -0.036
range sd vcoef mad IQR skew kurt
0.778 0.096 -0.487 0.089 0.123 0.850 1.854
lowest : -0.481, -0.432, -0.409 (5), -0.398 (4), -0.377 (4)
highest: 0.201, 0.207, 0.210, 0.290, 0.297
' 95%-CI (classic)
------------------------------------------------------------------------------
11 - log_alcohol (numeric)
length n NAs unique 0s mean meanCI'
1'279 1'279 0 61 0 1.016 1.013
100.0% 0.0% 0.0% 1.018
.05 .10 .25 median .75 .90 .95
0.964 0.968 0.978 1.009 1.045 1.079 1.097
range sd vcoef mad IQR skew kurt
0.249 0.043 0.042 0.046 0.068 0.667 -0.263
lowest : 0.924 (2), 0.929, 0.940, 0.944 (2), 0.954 (22)
highest: 1.127 (2), 1.130, 1.134, 1.146 (6), 1.173
heap(?): remarkable frequency (8.8%) for the mode(s) (= 0.977723605288848)
' 95%-CI (classic)
------------------------------------------------------------------------------
12 - quality (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 6 6 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 3 8 0.625% 8 0.625%
2 4 42 3.284% 50 3.909%
3 5 545 42.611% 595 46.521%
4 6 510 39.875% 1'105 86.396%
5 7 159 12.432% 1'264 98.827%
6 8 15 1.173% 1'279 100.000%
------------------------------------------------------------------------------
13 - quality_gr (ordered, factor)
length n NAs unique levels dupes
1'279 1'279 0 3 3 y
100.0% 0.0%
level freq perc cumfreq cumperc
1 low 50 3.909% 50 3.909%
2 medium 1'055 82.486% 1'105 86.396%
3 high 174 13.604% 1'279 100.000%
4 Analysis
4.1 Correlation: All Pairs
In this section we will look into the relationships between each pair of numeric variables. We will start from graphical analysis (scatter plot matrices) and later will proceed to linear correlation analysis.
Scatter plot matrices of non-transformed and log-transformed variables are presented in Fig. 1 and Fig. 2. In the plot we see, that some variables are related (e.g., density and fixed acidity). It can also be noticed, that after log-transformation the relationship between some variables (e.g., pH and fixed acidity) look more linear. On the other hand, the plots are small and there are many points, so further investigation is needed to understand data better.
Python code
fig_sns_1 = sns.pairplot(
wine_train,
kind="hist",
height=1,
corner=True,
);
for ax in fig_sns_1.axes.flatten():
if ax:
# rotate x axis labels
ax.set_xlabel(ax.get_xlabel(), rotation = 90);
# rotate y axis labels
ax.set_ylabel(ax.get_ylabel(), rotation = 0);
# set y labels alignment
ax.yaxis.get_label().set_horizontalalignment('right');
fig_sns_1.tight_layout();
plt.show()Python code
fig_sns_2 = sns.pairplot(
wine_train_log,
kind="hist",
height=1,
corner=True,
);
for ax in fig_sns_2.axes.flatten():
if ax:
# rotate x axis labels
ax.set_xlabel(ax.get_xlabel(), rotation = 90);
# rotate y axis labels
ax.set_ylabel(ax.get_ylabel(), rotation = 0);
# set y labels alignment
ax.yaxis.get_label().set_horizontalalignment('right');
fig_sns_2.tight_layout();
plt.show()Next, the influence of logarithmic transformation on change in linear correlation coefficient is analyzed and summarized in the table below. The negative values or r_abs_difference (difference in absolute values of correlation coefficient) means that correlation decreased, positive value shows increase in correlation strength.
Python code
# Calculate correlation
corr_df_orig = pairwise_correlation(wine_train)
corr_df_log = pairwise_correlation(wine_train_log)
# Investigate the changes in coefficients due to transformation
(
pd.concat([
corr_df_orig.r.rename("r_orig_var"),
corr_df_log.r.rename("r_log_var")
], axis=1
)
.eval("r_abs_difference = abs(r_log_var) - abs(r_orig_var)")
.select_columns("r_abs_difference")
.agg(["min", "mean", "std", "max"])
) r_abs_difference
min -0.114
mean 0.016
std 0.042
max 0.146
The average shift in correlation coefficient was only slight, as in some coefficient started showing stronger, in other cases weaker correlation. The cases with largest positive and negative change (indicated by the the min and max values of r_abs_difference) are analyzed in the details sections. To sum up, it seams that in both cases log-transformation the true correlation was reflected better by the coefficients after the transformation.
Details: case study of chlorides vs. sulphates
After log-transformation, linear correlation strength between chlorides and sulphates decreased most evidently (change in absolute values is -0.11). It seems that the linear correlation coefficient size between these two variables is driven by 17 outlying points (where chlorides > 0.3, red dashed line in the plot below). Logarithmic transformation reduces this issue and true correlation strength is reflected better in this case. On the other hand, rank correlation coefficient suggests that there is no correlation or it is very weak.
R code
ggplot(py$wine_train, aes(x = chlorides, y = sulphates)) +
geom_point(alpha = 0.1, pch = 19) +
geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
ggtitle("Variables on Original Scale") +
ggplot(py$wine_train, aes(x = log(chlorides), y = log(sulphates))) +
geom_point(alpha = 0.1, pch = 19) +
ggtitle("Variables on Log Scale")R code
py$wine_train |> count(chlorides > 0.3)R code
c(
"r_original_variables" = with(
py$wine_train, cor(chlorides, sulphates, method = "pearson")
),
"r_log_variables" = with(
py$wine_train,
cor(log(chlorides), log(sulphates), method = "pearson")
),
"r_without_17_outliers" = with(
py$wine_train |> filter(chlorides < 0.3),
cor(chlorides, sulphates, method = "pearson")
),
"r_spearman_all_data" = with(
py$wine_train, cor(chlorides, sulphates, method = "spearman")
)
) |> round(2) r_original_variables r_log_variables r_without_17_outliers
0.31 0.20 0.05
r_spearman_all_data
-0.01
Details: case study of chlorides vs. density
After log-transformation, linear correlation size between chlorides and density increased most evidently (change in absolute values is +0.15). It seems that the linear correlation coefficient size again was influenced by a group of 17 outlying points (where chlorides > 0.3, red dashed line in the plot below) and logarithmic transformation reduced this issue and correlation of log-transformed variables and of non-transformed variables without the 17 points is almost of the same size (r = 0.36 vs. r = 0.34).
R code
ggplot(py$wine_train, aes(x = chlorides, y = density)) +
geom_point(alpha = 0.1, pch = 19) +
geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
ggtitle("Variables on Original Scale") +
ggplot(py$wine_train, aes(x = log(chlorides), y = density)) +
geom_point(alpha = 0.1, pch = 19) +
ggtitle("Log-Transformed Chloride")R code
py$wine_train |> count(chlorides > 0.3)R code
c(
"r_original_variables" = with(
py$wine_train, cor(chlorides, density, method = "pearson")
),
"r_log_variables" = with(
py$wine_train,
cor(log(chlorides), density, method = "pearson")
),
"r_without_17_outliers" = with(
py$wine_train |> filter(chlorides < 0.3),
cor(chlorides, density, method = "pearson")
),
"r_spearman_all_data" = with(
py$wine_train, cor(chlorides, density, method = "spearman")
)
) |> round(2) r_original_variables r_log_variables r_without_17_outliers
0.21 0.36 0.34
r_spearman_all_data
0.41
Fig. 3 contains a graphical representation of all pair-wise correlation coefficients. The “details” sections below contain numeric results while the two tables below the section list correlations with quality sore and with alcohol contents.
R code
py$wine_train_log |>
select_if(is.numeric) |>
ggstatsplot::ggcorrmat() +
scale_y_discrete(limits = rev)Details: correlation table (variables on original scale)
We can suspect, that linear correlation coefficients do not reflect the true strength skewed between variables correctly.
R code
py$corr_df_orig |>
rowwise() |>
mutate(
`CI95%_lower` = `CI95%`[[1]],
`CI95%_upper` = `CI95%`[[2]],
r = round(r, digits = 2)
) |>
select(-`CI95%`) |>
ungroup() |>
rownames_to_column("index")The results of correlation analysis as correlation matrix. All variables are on the original scale.
Details: correlation table (after log-transformation)
R code
py$corr_df_log |>
rowwise() |>
mutate(
`CI95%_lower` = `CI95%`[[1]],
`CI95%_upper` = `CI95%`[[2]],
r = round(r, digits = 2)
) |>
select(-`CI95%`) |>
ungroup() |>
rownames_to_column("index")The results of correlation analysis after right-skewed variables were log-transformed.
Results. The strongest correlation is detected between log_free_sulfur_dioxide vs. log_total_sulfur_dioxide (r =, 0.79 [0.77, 0.81], p_adj < 0.001) and log_fixed_acidity and ph (r = -0.71, 95% CI [-0.74, -0.69], p_adj < 0.001). In statistical modeling these variables will be used as explanatory variables. This means that in linear modeling, a multicollinearity problem may appear. So we have to pay attention to this while interpreting regression coefficients results.
4.2 Exploratory PCA
4.2.1 Model Diagnostics
Principal component analysis (PCA) will be used to explore relationships between target (wine quality score, quality group as well as alcohol content) and other variables.
R code
pca_model <- prcomp(py$wine_train_log |> select_if(is.numeric), scale. = TRUE)R code
fviz_eig(pca_model, addlabels = TRUE) + ylim(NA, 29)Details: table with eigenvalues and percentage of explained variance
R code
get_eigenvalue(pca_model) |> round(1)- To explain more than 80 % of variance, 5 components are required.
- “Elbow” method suggests 3-4 components.
- Eigenvalue above 1 rule suggest 4-5 components.
These numbers might be important while doing feature selection. For visual inspection, it will be used the first 3 principal components that account for 26.8 %, 20.1 %, and 14.6 % respectively.
4.2.2 PCA Individuals Map
In PCA “individuals map”, no very clear clusters are visible. Only it can be noticed that in the space of PC1 and PC2 (denoted as Dim1 and Dim2), roughly half area with points is more dense, and the other half is less dense.
R code
fviz_pca_ind(pca_model, label = FALSE, alpha = 0.2)R code
fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.2)After adding colors to the points, 3 partly overlapping clusters are visible. The direction of cluster centroids match the order of quality groups from lowest to highest. It seems that all 3 first PCs have influence on cluster separability.
R code
legend_txt = "Quality"
fviz_pca_ind(pca_model, label = FALSE, alpha = 0.3,
habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
palette = "Dark2") +
labs(color = legend_txt, fill = legend_txt, shape = legend_txt)R code
fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.3,
habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
palette = "Dark2") +
labs(color = legend_txt, fill = legend_txt, shape = legend_txt)4.2.3 PCA Correlation Circle Plot
The purpose of correlation circle plot is to investigate relationship between variables. The angle between arrows shows the strength of relationship in the analyzed PCA space, the length of arrow shows how well variable is represented in the PCA space.
The variables of interest are quality score and fraction of alcohol content. These variables will be highlighted.
R code
highlight_vars <- recode_factor(
names(py$wine_train_log |> select_if(is.numeric)),
"quality" = "quality",
"log_alcohol" = "log_alcohol",
.default = "other"
)R code
fviz_pca_var(pca_model,
axes = c(1, 2), repel = TRUE, alpha = 0.7,
col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
theme(legend.position = "none")R code
fviz_pca_var(pca_model,
axes = c(3, 2), repel = TRUE, alpha = 0.7,
col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
theme(legend.position = "none")4.2.4 PCA Biplot
The purpose of PCA biplots is to investigate the relationship between patterns in points (e.g., clusters) and variables.
R code
fviz_pca_biplot(pca_model,
axes = c(1, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
col.var = "black",
# col.var = highlight_vars,
label = "var",
addEllipses = TRUE,
habillage = py$wine_train_log$quality_gr,
palette = "Dark2"
) +
labs(color = legend_txt, fill = legend_txt, shape = legend_txt)R code
fviz_pca_biplot(pca_model,
axes = c(3, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
col.var = "black",
# col.var = highlight_vars,
label = "var",
addEllipses = TRUE,
habillage = py$wine_train_log$quality_gr,
palette = "Dark2"
) +
labs(color = legend_txt, fill = legend_txt, shape = legend_txt)4.3 Modelling Wine Quality
4.3.1 Correlation
Correlation strength between wine quality score and the remaining variables:
Python code
corr_df_log.query("X=='quality' | Y == 'quality'") X Y r CI95% p_adj
65 log_alcohol quality 0.447 [0.4, 0.49] <0.001
20 log_volatile_acidity quality -0.407 [-0.45, -0.36] <0.001
64 log_sulphates quality 0.345 [0.3, 0.39] <0.001
29 citric_acid quality 0.247 [0.19, 0.3] <0.001
44 log_chlorides quality -0.169 [-0.22, -0.12] <0.001
55 log_total_sulfur_dioxide quality -0.162 [-0.22, -0.11] <0.001
59 density quality -0.152 [-0.21, -0.1] <0.001
10 log_fixed_acidity quality 0.123 [0.07, 0.18] <0.001
62 ph quality -0.063 [-0.12, -0.01] 0.324
50 log_free_sulfur_dioxide quality -0.041 [-0.1, 0.01] >0.999
37 log_residual_sugar quality 0.016 [-0.04, 0.07] >0.999
4.4 Modelling Alcohol Content
The purpose of this section is to create a linear model that predicts the fraction of alcohol content in wine.
Quick pair-wise summary of how other variables relate to the fraction of alcohol content in wine is presented in the details section.
Details: plots and summaries of variables by alcohol content
Let’s investigate relationship between alcohol content and other variables.
R code
py$wine_train_log %>%
mutate(quality = ordered(quality)) %>%
DescTools::Desc(. ~ log_alcohol, data = ., verbose = 3)------------------------------------------------------------------------------
log_fixed_acidity ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.101
Spearman corr.: -0.077
Kendall corr. : -0.056
------------------------------------------------------------------------------
log_volatile_acidity ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.223
Spearman corr.: -0.214
Kendall corr. : -0.144
------------------------------------------------------------------------------
citric_acid ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.100
Spearman corr.: 0.083
Kendall corr. : 0.055
------------------------------------------------------------------------------
log_residual_sugar ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.069
Spearman corr.: 0.109
Kendall corr. : 0.076
------------------------------------------------------------------------------
log_chlorides ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.318
Spearman corr.: -0.306
Kendall corr. : -0.213
------------------------------------------------------------------------------
log_free_sulfur_dioxide ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.085
Spearman corr.: -0.080
Kendall corr. : -0.055
------------------------------------------------------------------------------
log_total_sulfur_dioxide ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.237
Spearman corr.: -0.253
Kendall corr. : -0.175
------------------------------------------------------------------------------
density ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : -0.489
Spearman corr.: -0.461
Kendall corr. : -0.329
------------------------------------------------------------------------------
ph ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.229
Spearman corr.: 0.204
Kendall corr. : 0.144
------------------------------------------------------------------------------
log_sulphates ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)
Pearson corr. : 0.152
Spearman corr.: 0.212
Kendall corr. : 0.148
------------------------------------------------------------------------------
quality ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6
3 4 5 6 7 8
mean 1.002 1.016 0.995 1.024 1.056 1.074
median 1.002 1.015 0.987 1.021 1.061 1.068
sd 0.038 0.040 0.031 0.042 0.037 0.044
IQR 0.041 0.066 0.035 0.065 0.051 0.058
n 8 42 545 510 159 15
np 0.625% 3.284% 42.611% 39.875% 12.432% 1.173%
NAs 0 0 0 0 0 0
0s 0 0 0 0 0 0
min 0.924 0.954 0.929 0.924 0.964 0.991
max 1.041 1.117 1.173 1.146 1.146 1.146
Q1 0.990 0.982 0.973 0.991 1.031 1.047
Q3 1.031 1.048 1.009 1.056 1.083 1.106
mad 0.032 0.048 0.020 0.047 0.040 0.048
skew -0.791 0.199 1.533 0.371 -0.200 -0.331
kurt -0.510 -0.833 3.497 -0.475 -0.572 -0.942
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16
Proportions of quality in the quantiles of log_alcohol:
Q1 Q2 Q3 Q4
3 0.3% 1.2% 1.0% 0.0%
4 2.1% 3.9% 3.7% 3.6%
5 68.2% 54.0% 31.7% 11.9%
6 28.2% 34.7% 49.3% 49.3%
7 1.2% 5.6% 13.7% 31.5%
8 0.0% 0.6% 0.7% 3.6%
------------------------------------------------------------------------------
quality_gr ~ log_alcohol (.)
Summary:
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3
low medium high
mean 1.014 1.009 1.057
median 1.011 1.000 1.061
sd 0.040 0.040 0.038
IQR 0.058 0.058 0.053
n 50 1'055 174
np 3.909% 82.486% 13.604%
NAs 0 0 0
0s 0 0 0
min 0.924 0.924 0.964
max 1.117 1.173 1.146
Q1 0.983 0.978 1.033
Q3 1.041 1.035 1.086
mad 0.044 0.038 0.040
skew 0.085 0.892 -0.177
kurt -0.520 0.362 -0.569
Kruskal-Wallis rank sum test:
Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16
Proportions of quality_gr in the quantiles of log_alcohol:
Q1 Q2 Q3 Q4
low 2.4% 5.0% 4.7% 3.6%
medium 96.5% 88.7% 81.0% 61.3%
high 1.2% 6.2% 14.3% 35.1%
4.4.1 Correlation
Correlation strength between alcohol concentration and the remaining variables:
Python code
corr_df_log.query("X=='log_alcohol' | Y == 'log_alcohol'") X Y r CI95% p_adj
58 density log_alcohol -0.489 [-0.53, -0.45] <0.001
65 log_alcohol quality 0.447 [0.4, 0.49] <0.001
43 log_chlorides log_alcohol -0.318 [-0.37, -0.27] <0.001
54 log_total_sulfur_dioxide log_alcohol -0.237 [-0.29, -0.19] <0.001
61 ph log_alcohol 0.229 [0.18, 0.28] <0.001
19 log_volatile_acidity log_alcohol -0.223 [-0.27, -0.17] <0.001
63 log_sulphates log_alcohol 0.152 [0.1, 0.21] <0.001
9 log_fixed_acidity log_alcohol -0.101 [-0.15, -0.05] 0.007
28 citric_acid log_alcohol 0.100 [0.05, 0.15] 0.007
49 log_free_sulfur_dioxide log_alcohol -0.085 [-0.14, -0.03] 0.039
36 log_residual_sugar log_alcohol 0.069 [0.01, 0.12] 0.191
4.4.2 Sequential Feature Selection: All Features
Let’s do sequential feature selection and investigate how many features might be enough.
Python code
formula_log_full = (
'log_alcohol ~ '
'log_total_sulfur_dioxide + log_chlorides + log_sulphates + '
'log_residual_sugar + log_free_sulfur_dioxide + '
'log_volatile_acidity + log_fixed_acidity + '
'density + citric_acid + ph + quality'
)Python code
formula_full = (
'alcohol ~ '
'total_sulfur_dioxide + chlorides + sulphates + '
'residual_sugar + free_sulfur_dioxide + '
'volatile_acidity + fixed_acidity + '
'density + citric_acid + ph + quality'
)Python code
np.random.seed(251)
sfs_res1 = do_sfs_linear_regression(formula_full, wine_train)
show_sfs_results_lin_reg(sfs_res1, "(No Log-Transformed Variables Included)");Forward Feature Selection with 5-fold CV
(No Log-Transformed Variables Included)
Selected variables:
[ n = 9, avg. R² = 0.672 ]
1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. volatile_acidity
5. fixed_acidity
6. density
7. citric_acid
8. ph
9. quality
Python code
np.random.seed(251)
sfs_res2 = do_sfs_linear_regression(formula_full, wine_train, False)
show_sfs_results_lin_reg(sfs_res2, "(No Log-Transformed Variables Included)");Backward Feature Selection with 5-fold CV
(No Log-Transformed Variables Included)
Selected variables:
[ n = 7, avg. R² = 0.667 ]
1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. fixed_acidity
5. density
6. ph
7. quality
Traceback (most recent call last):
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\backends\backend_qt.py", line 454, in _draw_idle
self.draw()
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 405, in draw
self.figure.draw(self.renderer)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 74, in draw_wrapper
result = draw(artist, renderer, *args, **kwargs)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 51, in draw_wrapper
return draw(artist, renderer)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\figure.py", line 3082, in draw
mimage._draw_list_compositing_images(
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\image.py", line 131, in _draw_list_compositing_images
a.draw(renderer)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 51, in draw_wrapper
return draw(artist, renderer)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 3064, in draw
self._update_title_position(renderer)
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 2997, in _update_title_position
if (ax.xaxis.get_ticks_position() in ['top', 'unknown']
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axis.py", line 2399, in get_ticks_position
self._get_ticks_position()]
File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axis.py", line 2111, in _get_ticks_position
major = self.majorTicks[0]
IndexError: list index out of range
Python code
np.random.seed(251)
sfs_res3 = do_sfs_linear_regression(formula_log_full, wine_train_log)
show_sfs_results_lin_reg(sfs_res3, "(Log-Transformed Variables Included)");Forward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 8, avg. R² = 0.690 ]
1. log_sulphates
2. log_residual_sugar
3. log_free_sulfur_dioxide
4. log_fixed_acidity
5. density
6. citric_acid
7. ph
8. quality
Python code
sfs_res4 = do_sfs_linear_regression(formula_log_full, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4, "(Log-Transformed Variables Included)");Backward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 6, avg. R² = 0.685 ]
1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
6. quality
Despite the fact that algorithm suggests more, but 5-6 variables would be enough as additional features give just a slight improvement. Log-transformed features give a slightly better performance than the features on the original scale.
Python code
n_features = 6
print("No log-transformed features")No log-transformed features
Python code
get_sfs_performance(sfs_res2, n_features){'n_features': 6, 'mean R²': 0.664, 'median R²': 0.661, 'sd R²': 0.014}
Python code
sfs_res2.get_metric_dict()[n_features]['feature_names']('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph', 'quality')
Python code
print("With log-transformed features")With log-transformed features
Python code
get_sfs_performance(sfs_res4, n_features){'n_features': 6, 'mean R²': 0.685, 'median R²': 0.667, 'sd R²': 0.031}
Python code
selected_features_1 = sfs_res4.get_metric_dict()[n_features]['feature_names']
selected_features_1('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph', 'quality')
Python code
formula_selected_1 = "log_alcohol ~ " + " + ".join(selected_features_1)4.4.3 Sequential Feature Selection: No Quality
Quality score seems to be the most expensive feature and might be not available at prediction time. So let’s remove it from options to select features.
Python code
formula_log_q = (
'log_alcohol ~ '
'log_total_sulfur_dioxide + log_chlorides + log_sulphates + '
'log_residual_sugar + log_free_sulfur_dioxide + '
'log_volatile_acidity + log_fixed_acidity + '
'density + citric_acid + ph'
)Python code
formula_q = (
'alcohol ~ '
'total_sulfur_dioxide + chlorides + sulphates + '
'residual_sugar + free_sulfur_dioxide + '
'volatile_acidity + fixed_acidity + '
'density + citric_acid + ph'
)Python code
np.random.seed(251)
sfs_res1q = do_sfs_linear_regression(formula_q, wine_train)
show_sfs_results_lin_reg(sfs_res1q, "(No Log-Transformed Variables Included)");Forward Feature Selection with 5-fold CV
(No Log-Transformed Variables Included)
Selected variables:
[ n = 7, avg. R² = 0.655 ]
1. total_sulfur_dioxide
2. sulphates
3. residual_sugar
4. fixed_acidity
5. density
6. citric_acid
7. ph
Python code
np.random.seed(251)
sfs_res2q = do_sfs_linear_regression(formula_q, wine_train, False)
show_sfs_results_lin_reg(sfs_res2q, "(No Log-Transformed Variables Included)");Backward Feature Selection with 5-fold CV
(No Log-Transformed Variables Included)
Selected variables:
[ n = 5, avg. R² = 0.647 ]
1. sulphates
2. residual_sugar
3. fixed_acidity
4. density
5. ph
Python code
np.random.seed(251)
sfs_res3q = do_sfs_linear_regression(formula_log_q, wine_train_log)
show_sfs_results_lin_reg(sfs_res3q, "(Log-Transformed Variables Included)");Forward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 7, avg. R² = 0.681 ]
1. log_total_sulfur_dioxide
2. log_sulphates
3. log_residual_sugar
4. log_fixed_acidity
5. density
6. citric_acid
7. ph
Python code
sfs_res4q = do_sfs_linear_regression(formula_log_q, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4q, "(Log-Transformed Variables Included)");Backward Feature Selection with 5-fold CV
(Log-Transformed Variables Included)
Selected variables:
[ n = 5, avg. R² = 0.673 ]
1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
Despite the fact that the algorithm suggests using more features, it seems that 5 features would be enough as additional features give just a slight improvement. Again, log-transformed features give a slightly better performance than the features on the original scale.
Python code
n_features_q = 5
print("No log-transformed features")No log-transformed features
Python code
get_sfs_performance(sfs_res2q, n_features_q){'n_features': 5, 'mean R²': 0.647, 'median R²': 0.641, 'sd R²': 0.021}
Python code
sfs_res2q.get_metric_dict()[n_features_q]['feature_names']('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph')
Python code
print("With log-transformed features")With log-transformed features
Python code
get_sfs_performance(sfs_res4q, n_features_q){'n_features': 5, 'mean R²': 0.673, 'median R²': 0.666, 'sd R²': 0.034}
Python code
selected_features_2 = sfs_res4q.get_metric_dict()[n_features_q]['feature_names']
selected_features_2('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph')
Python code
formula_selected_2 = "log_alcohol ~ " + " + ".join(selected_features_2)4.4.4 Compare Null, Full and Selected Models
Create scaled data for feature importance.
Python code
wine_train_log_num = wine_train_log.select_dtypes(include="number")
scaler_log = StandardScaler().fit(wine_train_log_num)
df_log_scaled = scaler_log.transform(wine_train_log_num)
df_log_scaled = pd.DataFrame(df_log_scaled, columns=wine_train_log_num.columns)Feature importance in the null model:
Python code
formula_log_null = 'log_alcohol ~ 1'
model_null = smf.ols(formula_log_null, df_log_scaled)
results_null = model_null.fit()
# print(results_null.summary())
# print(f"R² = {results_null.rsquared_adj:.3f}")
# print(" ")
# print(results_null.params.sort_values(key=abs, ascending=False))Feature importance in the selected model (with quality score):
Python code
model_selected_1 = smf.ols(formula_selected_1, df_log_scaled)
results_selected_1 = model_selected_1.fit()
print(results_selected_1.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_alcohol R-squared: 0.692
Model: OLS Adj. R-squared: 0.690
Method: Least Squares F-statistic: 475.4
Date: Sat, 04 Feb 2023 Prob (F-statistic): 1.23e-320
Time: 16:02:46 Log-Likelihood: -1062.6
No. Observations: 1279 AIC: 2139.
Df Residuals: 1272 BIC: 2175.
Df Model: 6
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -2.557e-14 0.016 -1.64e-12 1.000 -0.031 0.031
log_sulphates 0.1652 0.017 9.613 0.000 0.132 0.199
log_residual_sugar 0.4324 0.018 24.160 0.000 0.397 0.467
log_fixed_acidity 0.9256 0.031 29.527 0.000 0.864 0.987
density -1.1012 0.027 -41.437 0.000 -1.153 -1.049
ph 0.5669 0.023 24.407 0.000 0.521 0.612
quality 0.1376 0.018 7.536 0.000 0.102 0.173
==============================================================================
Omnibus: 128.690 Durbin-Watson: 1.946
Prob(Omnibus): 0.000 Jarque-Bera (JB): 209.744
Skew: 0.703 Prob(JB): 2.85e-46
Kurtosis: 4.399 Cond. No. 4.15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_selected_1.rsquared_adj:.3f}")R² = 0.690
Python code
print(" ")Python code
print(results_selected_1.params.sort_values(key=abs, ascending=False))density -1.101
log_fixed_acidity 0.926
ph 0.567
log_residual_sugar 0.432
log_sulphates 0.165
quality 0.138
Intercept -0.000
dtype: float64
Feature importance in the selected model (without quality score):
Python code
model_selected_2 = smf.ols(formula_selected_2, df_log_scaled)
results_selected_2 = model_selected_2.fit()
print(results_selected_2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_alcohol R-squared: 0.678
Model: OLS Adj. R-squared: 0.677
Method: Least Squares F-statistic: 535.6
Date: Sat, 04 Feb 2023 Prob (F-statistic): 5.66e-310
Time: 16:02:47 Log-Likelihood: -1090.5
No. Observations: 1279 AIC: 2193.
Df Residuals: 1273 BIC: 2224.
Df Model: 5
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -2.745e-14 0.016 -1.73e-12 1.000 -0.031 0.031
log_sulphates 0.2136 0.016 13.109 0.000 0.182 0.246
log_residual_sugar 0.4584 0.018 25.551 0.000 0.423 0.494
log_fixed_acidity 1.0022 0.030 33.073 0.000 0.943 1.062
density -1.1839 0.025 -47.875 0.000 -1.232 -1.135
ph 0.5924 0.023 25.232 0.000 0.546 0.638
==============================================================================
Omnibus: 128.278 Durbin-Watson: 1.961
Prob(Omnibus): 0.000 Jarque-Bera (JB): 212.311
Skew: 0.695 Prob(JB): 7.89e-47
Kurtosis: 4.432 Cond. No. 3.79
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_selected_2.rsquared_adj:.3f}")R² = 0.677
Python code
print(" ")Python code
print(results_selected_2.params.sort_values(key=abs, ascending=False))density -1.184
log_fixed_acidity 1.002
ph 0.592
log_residual_sugar 0.458
log_sulphates 0.214
Intercept -0.000
dtype: float64
Feature importance in the full model:
Python code
model_full = smf.ols(formula_log_full, df_log_scaled)
results_full = model_full.fit()
print(results_full.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_alcohol R-squared: 0.704
Model: OLS Adj. R-squared: 0.701
Method: Least Squares F-statistic: 273.9
Date: Sat, 04 Feb 2023 Prob (F-statistic): 0.00
Time: 16:02:48 Log-Likelihood: -1036.3
No. Observations: 1279 AIC: 2097.
Df Residuals: 1267 BIC: 2158.
Df Model: 11
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept -2.481e-14 0.015 -1.62e-12 1.000 -0.030 0.030
log_total_sulfur_dioxide -0.0707 0.028 -2.534 0.011 -0.125 -0.016
log_chlorides -0.0495 0.018 -2.750 0.006 -0.085 -0.014
log_sulphates 0.1755 0.018 9.789 0.000 0.140 0.211
log_residual_sugar 0.4212 0.018 23.504 0.000 0.386 0.456
log_free_sulfur_dioxide -0.0049 0.027 -0.183 0.855 -0.057 0.047
log_volatile_acidity 0.1002 0.021 4.726 0.000 0.059 0.142
log_fixed_acidity 0.8079 0.037 21.860 0.000 0.735 0.880
density -1.0620 0.029 -37.148 0.000 -1.118 -1.006
citric_acid 0.1477 0.026 5.619 0.000 0.096 0.199
ph 0.5402 0.025 22.015 0.000 0.492 0.588
quality 0.1371 0.019 7.332 0.000 0.100 0.174
==============================================================================
Omnibus: 121.513 Durbin-Watson: 1.964
Prob(Omnibus): 0.000 Jarque-Bera (JB): 192.747
Skew: 0.681 Prob(JB): 1.40e-42
Kurtosis: 4.326 Cond. No. 5.84
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_full.rsquared_adj:.3f}")R² = 0.701
Python code
print(" ")Python code
print(results_full.params.sort_values(key=abs, ascending=False))density -1.062
log_fixed_acidity 0.808
ph 0.540
log_residual_sugar 0.421
log_sulphates 0.176
citric_acid 0.148
quality 0.137
log_volatile_acidity 0.100
log_total_sulfur_dioxide -0.071
log_chlorides -0.050
log_free_sulfur_dioxide -0.005
Intercept -0.000
dtype: float64
4.4.5 Model Diagnostics
Python code
lm = results_selected_2Python code
(results_selected_2.get_influence().cooks_distance[0] > 1).any()False
Python code
plt.clf()
fig = sm.graphics.plot_partregress_grid(lm);eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
Python code
fig.tight_layout();
plt.show()Python code
plt.clf()
fig = sm.graphics.plot_ccpr_grid(lm)
fig.tight_layout(pad=1.0)
plt.show()Python code
name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
test = sms.het_breuschpagan(lm.resid, lm.model.exog)
[*zip(name, test)][('Lagrange multiplier statistic', 90.84904277133138), ('p-value', 4.455957158445557e-18), ('f-value', 19.467363257890643), ('f p-value', 1.0150662334875106e-18)]
5 Looker Studio Dashboard
A part of this project was to create Looker Studio dashboard. A dashboard to explore some properties of red wine were introduced. You can find a print screen of the dashboard and a link to it below.
Link to the dashboard.
6 Limitations and Suggestions on Improvement
- It would be easier to maintain code if it would be written in a single language (either R or Python). Bilingual code requires expert who knows both languages.